"An Introduction to Data Visualization in a Pandemic World" is brought to you by the Centre for the Analysis of Genome Evolution & Function's (CAGEF) bioinformatics training initiative. This talk was developed to introduce participants of the Bioinformatics and Computational Biology Student Union (BCBSU) Biohacks 2021 conference to the world of R by focusing on basics concepts, methods, and packages for formatting and plotting scientific data. While the datasets and examples used in this talk are centred on SARS-CoV2 epidemiological data, the lessons learned herein can be applied broadly.
The aim for the end of this presentation is for students to recognize how to import, format, and display data based on their intended message and audience. The format and style of these visualizations will help to identify and convey the key message(s) in your data.
The structure of the presentation is a code-along style using Jupyter notebooks. At the start of this presentation, a skeleton version will be provided for use on the University of Toronto Jupyter Hub so students can program along with the presenter.
To reproduce the repository from GitHub on your Jupyter Hub simply click on this link
This will be your 1-hour crash-course on Jupyter notebooks and R! At the end of this lecture we will have covered the following topics
tidyversegrey background - a package, function, code, command or directory. Backticks are also use for in-line code.
italics - an important term or concept or an individual file or folder
bold - heading or a term that is being defined
blue text - named or unnamed hyperlink
Today's datasets will focus on epidemiological data from the Ontario provincial government found here.
This dataset was obtained from the Ontario provincial website and holds statistics regarding SARS-CoV-2 cases throughout different public health units in the province. It is in a comma separate format and has been collected since 2020-03-24.
This dataset was obtained from the Ontario provincial website and holds statistics regarding SARS-CoV-2 throughout the province. It is in a comma separated format and has been growing/expanding since initial tracking started on 2020-01-26.
If you'd like to code along with this presentation, please begin by clicking on the following link which will clone a GitHub repository to your personal Jupyter Hub with the University of Toronto.
Work with your Jupyter Notebook on the University of Toronto JupyterHub will all be contained within a new browser tab with the address bar showing something like
https://jupyter.utoronto.ca/user/assigned-username-hexadecimal/tree/BCBSU_Biohacks_2021
All of this is running remotely on a University of Toronto server rather than your own machine.
You'll see a directory structure from your home folder:
ie \BCBSU_Biohacks_2021\ Clicking on that, you'll find Intro_R_dataViz.skeleton.ipynb which is the notebook we will use for today's code-along talk.
This presentation has been implemented on this platform to reduce the burden of having to install various programs. While installation can be a little tricky, it's really not that bad. For this introductory talk, however, you don't need to go through all of that just to learn the basics of coding.
Jupyter Notebooks also give us the option of inserting "markdown" text much like what your reading at this very exact moment. So we can intersperse ideas and information between our learning code blocks.
There is, however an appendix section at the end of this lecture detailing how to install Jupyter Notebooks (and the R-kernel for them) as well as independent installation of the R-kernel itself and a great integrated development environment (IDE) called RStudio.
Behind the scenes of each Jupyter notebook a programming kernel is running. For instance, depending on the setup, our notebooks can run a true or "emulated" R-kernel to interpret each code cell as if it were written specifically for the R language.
As we move from code cell to new code cell, all of the variables or objects we have created are stored within memory. We can refer to these as we run the code and move forward but if you overwrite or change them by mistake, you may to have rerun multiple cell blocks!
There are some options in the "Cell" menu that can alleviate these problems such as "Run All Above". If you think you've made a big error by overwriting a key object, you can use that option to "re-initialize" all of your previous code!
Remember these friendly keys/shortcuts:
Esc to enter "Command Mode" which basically takes you outside of the cell.Enter to edit a cellArrow keys to navigate up and down (and within a cell)Ctrl+Enter to run a cell (both code and markdown)Shift+Enter to run the current cell and move to the next one belowCtrl+/ to quickly comment and uncomment single or multiple lines of codeIn Command mode
a insert a new cell above the currently selected cellb insert a new cell below the currently selected cellm converts a section to a markdown celly converts a section to a code cell# inside a code cell comments out your coder converts a section to a raw__ nbconvert cell. This is most helpful when wishing to preserve a code format without running it through the kernel. Depending on your needs, you may find yourself doing the following:
Jupyter allows you to alternate between "markdown" notes and "code" that can be run or re-run on the fly.
Each data run and it's results can be saved individually as a new notebook or as new cells to compare data and small changes for your analyses!
So... what are in these packages? A package can be a collection of
Functions are the basic workhorses of R; they are the tools we use to analyze our data. Each function can be thought of as a unit that has a specific task. A function takes an input, evaluates it using an expression (e.g. a calculation, plot, merge, etc.), and returns an output (a single value, multiple values, a graphic, etc.).
In this course we will rely a lot on a package called tidyverse which is also dependent upon a series of other packages.
repr- a package useful for altering some of the attributes of objects related to the R kernel.
tidyverse which included a number of packages including dplyr, tidyr, stringr, forcats and ggplot2
viridis helps to create color-blind palettes for our data visualizations
lubridate and zoo are helper packages used for working with date formats in R
Let's run our first code cell!
# Packages to help tidy our data
library(tidyverse)
# Packages for the graphical analysis section
library(repr)
library(viridis)
# packages used for working with/formating dates in R
library(lubridate)
library(zoo)
-- Attaching packages --------------------------------------- tidyverse 1.3.0 -- v ggplot2 3.3.2 v purrr 0.3.4 v tibble 3.0.4 v dplyr 1.0.2 v tidyr 1.1.2 v stringr 1.4.0 v readr 1.4.0 v forcats 0.5.0 -- Conflicts ------------------------------------------ tidyverse_conflicts() -- x dplyr::filter() masks stats::filter() x dplyr::lag() masks stats::lag() Loading required package: viridisLite Attaching package: 'lubridate' The following objects are masked from 'package:base': date, intersect, setdiff, union Attaching package: 'zoo' The following objects are masked from 'package:base': as.Date, as.Date.numeric
There are many tips and tricks to remember about R but here we'll quickly recall some foundation knowledge that we'll need when further into this lesson.
If we want to hold onto a number, calculation, or object we need to assign it to a named variable. R has multiple methods for assigning a value to a variable and an order of precedence!
-> Rightward assignment: we won't really be using this in our course.
<- Leftward assignment: assignment used by most 'authentic' R programmers but really just a historical throwback.
= Leftward assignment: commonly used token for assignment in many other programming languages but be careful as it carries dual meaning!
What do I mean by 'types' of data?
a or aa or @c#o0* 7.5 1TRUE or FALSEThe job of data structures is to "host" the different data types. There are five types of data structures in R:
typeof() and length(). 1 to length(your_vector) and can be accessed with []# Build a character vector
char.vector <- c("Canada", "United States", "Great Britain")
char.vector
# subset by a single value
char.vector[2]
# subset by multiple values
char.vector[2:3]
# subset by removing values (cannot be mixed with positive values)
char.vector[c(-1, -3)]
# subset with repeating multiple values
char.vector[c(1, 2, 3, 3, 2, 1)]
# Build a character vector but include variable names
character.vector <- c(a = "Canada", b = "United States", c = "Great Britain")
character.vector
# subset by element name
character.vector[c("a", "c")]
# subset by a vector of logicals
character.vector[c(FALSE, TRUE, TRUE)]
character.vector[character.vector != "Canada"]
R will implicitly force (coerce) your vector to be of one data type, in this case the type that is most inclusive is a character vector. When we explicitly coerce a change from one data type to the next, it is known as casting. You can cast between certain data types and also object types.
as.logical(), as.integer(), as.double(), as.numeric(), as.character(), and as.factor()as.data.frame(), as.list(), and as.matrix()Importantly, when coercing, the R kernel convert from more specific to general types usually in this order:
# Make a logical vector
logical.vector <- c(TRUE, FALSE, TRUE, FALSE, FALSE)
str(logical.vector)
# Make a numeric vector
numeric.vector <- c(-1:10)
str(numeric.vector)
# Make a mixed vector. Take a note of the type
mixed.vector <- c(FALSE, TRUE, 1, 2, "three", 4, 5, "six")
str(mixed.vector)
logi [1:5] TRUE FALSE TRUE FALSE FALSE int [1:12] -1 0 1 2 3 4 5 6 7 8 ... chr [1:8] "FALSE" "TRUE" "1" "2" "three" "4" "5" "six"
# Attempt to coerce our vectors
# logical to numeric
as.numeric(logical.vector)
# numeric to logical
as.logical(numeric.vector)
# numeric to character
as.character(numeric.vector)
# mixed to a numeric. Note what happens when elements cannot be converted
as.numeric(mixed.vector)
Warning message in eval(expr, envir, enclos): "NAs introduced by coercion"
Now that we have had the opportunity to create a few different vector objects, let's talk about what an object class is. An object class can be thought of as a structured with attributes that will behave a certain way when passed to a function. Because of this
Some R package developers have created their own object classes. For example, many of the functions in the tidyverse generate tibble objects. They are behave in most ways like a data frame but have a more refined print structure, making it easier to see information such as column types when viewing them quickly. In general, from a trouble-shooting standpoint, it is good to be aware that your data may need to be formatted to fit a certain class of object when using different packages.
Whereas matrices are 2-dimensional structures limited to a single specific type of data within each instance, data frames are more complex as each column of the structure can be treated like a vector. The data frame, however, can have multiple data types mixed across each different columns. Data frame rules to remember are:
Data frames allows us to generate tables of mixed information much like an Excel spreadsheet.
# Generate a data frame with different variable/column types
mixed.df <- data.frame(country = character.vector,
values = numeric.vector[2:4],
commonwealth = logical.vector[1:3])
str(mixed.df)
'data.frame': 3 obs. of 3 variables: $ country : chr "Canada" "United States" "Great Britain" $ values : int 0 1 2 $ commonwealth: logi TRUE FALSE TRUE
nrow(data_frame) # retrieve the number of rows in a data frame
ncol(data_frame) # retrieve the number of columns in a data frame
data_frame$column_name # Access a specific column by it's name
data_frame[x,y] # Access a specific element located at row x, column y
rownames(data_frame) # retrieve or assign row names to your data frame
colnames(data_frame) # retrieve or assign columns names to your data frame
There are many more ways to access and manipulate data frames that we'll explore further down the road. Let's review some basic data frame code.
# query the dimensions of the data frame
dim(mixed.df)
nrow(mixed.df)
ncol(mixed.df)
# row and column names
rownames(mixed.df)
colnames(mixed.df)
# print the mixed data frame
mixed.df
# Access portions of the data frame
# a single column
str(mixed.df$country)
# a single element
mixed.df[2, 3]
mixed.df[3, "country"]
# multiple rows
mixed.df[c(1,3), ]
mixed.df[-2, ]
| country | values | commonwealth | |
|---|---|---|---|
| <chr> | <int> | <lgl> | |
| a | Canada | 0 | TRUE |
| b | United States | 1 | FALSE |
| c | Great Britain | 2 | TRUE |
chr [1:3] "Canada" "United States" "Great Britain"
| country | values | commonwealth | |
|---|---|---|---|
| <chr> | <int> | <lgl> | |
| a | Canada | 0 | TRUE |
| c | Great Britain | 2 | TRUE |
| country | values | commonwealth | |
|---|---|---|---|
| <chr> | <int> | <lgl> | |
| a | Canada | 0 | TRUE |
| c | Great Britain | 2 | TRUE |
Ah, the dreaded factors! A factor is a class of object used to encode a character vector into categories. They are used to store categorical variables and although it is tempting to think of them as character vectors this is a dangerous mistake.
Factors make perfect sense if you are a statistician designing a programming language (!) but to everyone else they exist solely to torment us with confusing errors. A factor is really just an integer vector or character data with an additional attribute, called levels(), which defines the possible values.
Why not just use character vectors, you ask?
Believe it or not factors do have some useful properties. For example, factors allow you to specify all possible values a variable may take even if those values are not in your data set. Think of conditional formatting in Excel. We also use them heavily in generating statistical analyses and in grouping data when we want to visualize it.
For more information about factors, check out the appendix!
# Generate a data frame and include factors for all character vectors
str(data.frame(country = character.vector,
values = numeric.vector[2:4],
commonwealth = logical.vector[1:3],
continent = c("North America", "North America", "Europe"),
stringsAsFactors = TRUE)
)
'data.frame': 3 obs. of 4 variables: $ country : Factor w/ 3 levels "Canada","Great Britain",..: 1 3 2 $ values : int 0 1 2 $ commonwealth: logi TRUE FALSE TRUE $ continent : Factor w/ 2 levels "Europe","North America": 2 2 1
# Explicitly define factors for each variable.
str(data.frame(country = factor(character.vector),
values = numeric.vector[2:4],
commonwealth = logical.vector[1:3],
continent = c("North America", "North America", "Europe"),
stringsAsFactors = FALSE)
)
'data.frame': 3 obs. of 4 variables: $ country : Factor w/ 3 levels "Canada","Great Britain",..: 1 3 2 $ values : int 0 1 2 $ commonwealth: logi TRUE FALSE TRUE $ continent : chr "North America" "North America" "Europe"
Missing values in R are handled as NA or (Not Available). Impossible values (like the results of dividing by zero) are represented by NaN (Not a Number). These types of values can be considered null values. These two types of values, specially NAs, have special ways to be dealt with otherwise it may lead to errors in the functions.
For our purposes, we are not interested in keeping NA data within our datasets so we will usually detect and remove them or replace them within our data after it is imported.
is.na() returns a logical vector reporting which values from your query are NA.complete.cases() returns a logical for row without any NA values.NA values with the na.rm = TRUE parameter: ie mean(), sum() etc.tidyr package can also be used to work with NA values.# Add some NAs to our data frame
mixed.df <- data.frame(country = character.vector,
values = c(3, NA, 9),
commonwealth = logical.vector[1:3],
continent = c("North America", "North America", "Europe"),
measure = c("metric", NA, "metric")
)
# Output the data frame to look at
mixed.df
| country | values | commonwealth | continent | measure | |
|---|---|---|---|---|---|
| <chr> | <dbl> | <lgl> | <chr> | <chr> | |
| a | Canada | 3 | TRUE | North America | metric |
| b | United States | NA | FALSE | North America | NA |
| c | Great Britain | 9 | TRUE | Europe | metric |
# Which entries are NA?
is.na(mixed.df)
# Which rows are incomplete?
complete.cases(mixed.df)
# Use some math functions
sum(mixed.df$values, na.rm=TRUE)
| country | values | commonwealth | continent | measure | |
|---|---|---|---|---|---|
| a | FALSE | FALSE | FALSE | FALSE | FALSE |
| b | FALSE | TRUE | FALSE | FALSE | TRUE |
| c | FALSE | FALSE | FALSE | FALSE | FALSE |
tidyverse¶Let's begin with some definitions:
In data science, long format is preferred over wide format because it allows for an easier and more efficient subset and manipulation of the data. To read more about wide and long formats, visit here.
Why tidy data?
Data cleaning (or dealing with 'messy' data) accounts for a huge chunk of data scientist's time. Ultimately, we want to get our data into a 'tidy' format (long format) where it is easy to manipulate, model and visualize. Having a consistent data structure and tools that work with that data structure can help this process along.
Tidy data has:
This seems pretty straight forward, and it is. It is the datasets you get that will not be straight forward. Having a map of where to take your data is helpful to unraveling its structure and getting it into a usable format.
readr package - "All roads lead to Rome.."¶... but not all roads are easy to travel.
Depending on format, data files can be opened in a number of ways. The simplest methods we will use involve the readr package as part of the tidyverse. These functions have already been developed to simplify the import process for users. The functions we will use most often are:
read_delim(), read_csv(), read_tsv(), read_csv2() [European datasets]read_lines()Remember: to learn more about a function you can type it into the console ie ?read_csv and it will bring up a help page on that function!
Let's read in a dataset that we can convert from wide to long format.
# ?read_csv
covid_phu.df <- read_csv("./data/Ontario_daily_change_in_cases_by_phu.csv")
-- Column specification -------------------------------------------------------- cols( .default = col_double(), Date = col_date(format = "") ) i Use `spec()` for the full column specifications.
# Check the structure and characteristics of covid_phu
str(covid_phu.df)
tail(covid_phu.df)
tibble [354 x 36] (S3: spec_tbl_df/tbl_df/tbl/data.frame) $ Date : Date[1:354], format: "2020-03-24" "2020-03-25" ... $ Algoma_Public_Health_Unit : num [1:354] NA 0 0 0 NA NA 3 0 1 0 ... $ Brant_County_Health_Unit : num [1:354] NA 1 0 0 NA NA 9 3 1 5 ... $ Chatham-Kent_Health_Unit : num [1:354] NA 0 0 0 NA NA 3 0 2 2 ... $ Durham_Region_Health_Department : num [1:354] NA 3 1 5 NA NA 56 21 24 25 ... $ Eastern_Ontario_Health_Unit : num [1:354] NA 0 0 0 NA NA 5 1 8 6 ... $ Grey_Bruce_Health_Unit : num [1:354] NA 1 0 1 NA NA 5 1 1 1 ... $ Haldimand-Norfolk_Health_Unit : num [1:354] NA 0 0 0 NA NA 3 4 15 10 ... $ Haliburton,_Kawartha,_Pine_Ridge_District_Health_Unit : num [1:354] NA 0 1 14 NA NA 12 10 8 9 ... $ Halton_Region_Health_Department : num [1:354] NA 1 4 1 NA NA 8 7 27 18 ... $ Hamilton_Public_Health_Services : num [1:354] NA 3 4 1 NA NA 38 17 7 20 ... $ Hastings_and_Prince_Edward_Counties_Health_Unit : num [1:354] NA 0 2 0 NA NA 3 0 1 4 ... $ Huron_Perth_District_Health_Unit : num [1:354] NA 0 0 0 NA NA 5 1 3 3 ... $ Kingston,_Frontenac_and_Lennox_&_Addington_Public_Health: num [1:354] NA 3 5 0 NA NA 12 8 11 4 ... $ Lambton_Public_Health : num [1:354] NA 0 0 5 NA NA 13 9 10 17 ... $ Leeds,_Grenville_and_Lanark_District_Health_Unit : num [1:354] NA 0 0 0 NA NA 8 8 8 7 ... $ Middlesex-London_Health_Unit : num [1:354] NA 0 2 4 NA NA 20 8 8 22 ... $ Niagara_Region_Public_Health_Department : num [1:354] NA 1 1 2 NA NA 22 10 8 18 ... $ North_Bay_Parry_Sound_District_Health_Unit : num [1:354] NA 0 0 1 NA NA 3 0 0 0 ... $ Northwestern_Health_Unit : num [1:354] NA 0 0 1 NA NA 1 0 0 1 ... $ Ottawa_Public_Health : num [1:354] NA 3 0 5 NA NA 52 9 37 118 ... $ Peel_Public_Health : num [1:354] NA 3 13 15 NA NA 95 42 21 114 ... $ Peterborough_Public_Health : num [1:354] NA 0 2 3 NA NA 18 2 1 9 ... $ Porcupine_Health_Unit : num [1:354] NA 0 3 0 NA NA 6 0 5 3 ... $ Region_of_Waterloo,_Public_Health : num [1:354] NA 2 0 3 NA NA 60 5 13 13 ... $ Renfrew_County_and_District_Health_Unit : num [1:354] NA 0 0 0 NA NA 1 1 2 3 ... $ Simcoe_Muskoka_District_Health_Unit : num [1:354] NA 0 1 4 NA NA 25 9 5 8 ... $ Southwestern_Public_Health : num [1:354] NA 0 0 2 NA NA 3 2 2 1 ... $ Sudbury_&_District_Health_Unit : num [1:354] NA 1 0 1 NA NA 1 2 2 3 ... $ Thunder_Bay_District_Health_Unit : num [1:354] NA 0 0 0 NA NA 2 1 1 0 ... $ Timiskaming_Health_Unit : num [1:354] NA 0 1 0 NA NA 0 1 0 0 ... $ Toronto_Public_Health : num [1:354] NA 17 21 22 NA NA 197 4 32 282 ... $ Wellington-Dufferin-Guelph_Public_Health : num [1:354] NA 1 1 0 NA NA 4 18 14 6 ... $ Windsor-Essex_County_Health_Unit : num [1:354] NA 1 2 0 NA NA 20 0 10 37 ... $ York_Region_Public_Health_Services : num [1:354] NA 5 5 34 NA NA 94 16 25 109 ... $ Total : num [1:354] 0 46 69 124 0 0 807 220 313 878 ... - attr(*, "spec")= .. cols( .. Date = col_date(format = ""), .. Algoma_Public_Health_Unit = col_double(), .. Brant_County_Health_Unit = col_double(), .. `Chatham-Kent_Health_Unit` = col_double(), .. Durham_Region_Health_Department = col_double(), .. Eastern_Ontario_Health_Unit = col_double(), .. Grey_Bruce_Health_Unit = col_double(), .. `Haldimand-Norfolk_Health_Unit` = col_double(), .. `Haliburton,_Kawartha,_Pine_Ridge_District_Health_Unit` = col_double(), .. Halton_Region_Health_Department = col_double(), .. Hamilton_Public_Health_Services = col_double(), .. Hastings_and_Prince_Edward_Counties_Health_Unit = col_double(), .. Huron_Perth_District_Health_Unit = col_double(), .. `Kingston,_Frontenac_and_Lennox_&_Addington_Public_Health` = col_double(), .. Lambton_Public_Health = col_double(), .. `Leeds,_Grenville_and_Lanark_District_Health_Unit` = col_double(), .. `Middlesex-London_Health_Unit` = col_double(), .. Niagara_Region_Public_Health_Department = col_double(), .. North_Bay_Parry_Sound_District_Health_Unit = col_double(), .. Northwestern_Health_Unit = col_double(), .. Ottawa_Public_Health = col_double(), .. Peel_Public_Health = col_double(), .. Peterborough_Public_Health = col_double(), .. Porcupine_Health_Unit = col_double(), .. `Region_of_Waterloo,_Public_Health` = col_double(), .. Renfrew_County_and_District_Health_Unit = col_double(), .. Simcoe_Muskoka_District_Health_Unit = col_double(), .. Southwestern_Public_Health = col_double(), .. `Sudbury_&_District_Health_Unit` = col_double(), .. Thunder_Bay_District_Health_Unit = col_double(), .. Timiskaming_Health_Unit = col_double(), .. Toronto_Public_Health = col_double(), .. `Wellington-Dufferin-Guelph_Public_Health` = col_double(), .. `Windsor-Essex_County_Health_Unit` = col_double(), .. York_Region_Public_Health_Services = col_double(), .. Total = col_double() .. )
| Date | Algoma_Public_Health_Unit | Brant_County_Health_Unit | Chatham-Kent_Health_Unit | Durham_Region_Health_Department | Eastern_Ontario_Health_Unit | Grey_Bruce_Health_Unit | Haldimand-Norfolk_Health_Unit | Haliburton,_Kawartha,_Pine_Ridge_District_Health_Unit | Halton_Region_Health_Department | ... | Simcoe_Muskoka_District_Health_Unit | Southwestern_Public_Health | Sudbury_&_District_Health_Unit | Thunder_Bay_District_Health_Unit | Timiskaming_Health_Unit | Toronto_Public_Health | Wellington-Dufferin-Guelph_Public_Health | Windsor-Essex_County_Health_Unit | York_Region_Public_Health_Services | Total |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <date> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | ... | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
| 2021-03-07 | 0 | 12 | 5 | 58 | 9 | 2 | 12 | 5 | 39 | ... | 36 | 3 | 34 | 53 | 3 | 329 | 38 | 32 | 116 | 1299 |
| 2021-03-08 | 0 | 20 | 5 | 68 | 15 | 0 | 4 | 4 | 51 | ... | 48 | 4 | 27 | 91 | 1 | 568 | 10 | 46 | 119 | 1631 |
| 2021-03-09 | 0 | 6 | 11 | 25 | 10 | 3 | 3 | 1 | 48 | ... | 30 | 5 | 24 | 39 | 1 | 343 | 10 | 30 | 105 | 1185 |
| 2021-03-10 | 0 | 14 | 9 | 48 | 11 | 1 | 6 | 4 | 48 | ... | 31 | 7 | 13 | 67 | 0 | 428 | 8 | 23 | 149 | 1316 |
| 2021-03-11 | 0 | 7 | 10 | 36 | 18 | 5 | 2 | 5 | 33 | ... | 43 | 6 | 11 | 48 | 0 | 294 | 3 | 39 | 79 | 1092 |
| 2021-03-12 | 1 | 11 | 10 | 35 | 12 | -1 | 6 | 4 | 34 | ... | 43 | 3 | 37 | 52 | 0 | 371 | 19 | 39 | 111 | 1371 |
From looking at our data public health unit data, we can see that it begins tracking on 2020-03-24 and goes up until 2021-03-12. In total there are observations for 354 days across 34 public health units. The final column appears to be a tally running for total cases across all PHUs reported on that date.
From the outset, we can see there are some issues with the data set that we'll want to resolve and we'll work through some tidyverse functions in order to do that. First let's quickly review some of the potential problems with our dataset.
new_cases for each Date observation. At the same time we will not collapse Total into that same variable.NA values. Many instance are likely due to no data being collected on those dates. For our purposes, it may be simpler to replace them with a value of 0.Before we tackle these issues, let's go ahead and review some of the tools at our disposal.
tidyverse package and it's contents make manipulating data easier¶While the tidyverse is composed of multiple packages, we will be focused on working with a subset of these: dplyr, tidyr, and stringr.
To save on memory and to help make our code more concise, we should also discuss the use of the %>% symbol. This is a redirction or pipe symbol similar to the | in Unix operating systems and is used for redirecting output from one function to the input of another. By thoughtfully combining this with other commands, we can alter or query our datasets with ease.
Note that many times in the code we will assume we are piping to the first parameter of a function. We can explicity pass the redirected data to somewhere else by using the period ie ..
dplyr has functions for accessing and altering your data¶arrange()count(), tally()distinct()filter()mutate(), transmute()select()summarize() or summarise()group_by(), reversed by ungroup()rename(), and relocate()tidyr has additional functions for reshaping our data¶pivot_longer() or previously gather()pivot_wider() or previously spread()extract()separate()unite()drop_na()replace_na()stringr provides functionality for searching data based on regular expressions¶str_count()str_detect()str_extract() and str_extract_all()str_match() and str_match_all()str_remove() and str_remove_all()str_split(), str_split_fixed(), and str_split_n()Keep or find strings matching a pattern: str_subset() and str_which()
Convert case of a string: str_to_upper() and str_to_lower()
str_c()str_flatten()str_sub()pivot_longer()¶Previously you may have used gather() from the tidyr package to melt wide data into a long format. Today we will use an actively developed version of this function called pivot_longer() which, for our purposes, will rely on three pieces parameters:
data: the data frame (and columns) that we wish to transform.names_to: the variable name of the new column to hold the collapsed information from our current columns.values_to: The variable name of the values for each observation that we are collapsing down.We'll be using a series of %>% so for now we won't save our work to a new object.
# Pivot the data into a long-format set
covid_phu.df %>%
pivot_longer(cols = c(2:35), names_to = "public_health_unit", values_to = "new_cases") %>%
# Just take a quick look at the output.
str()
tibble [12,036 x 4] (S3: tbl_df/tbl/data.frame) $ Date : Date[1:12036], format: "2020-03-24" "2020-03-24" ... $ Total : num [1:12036] 0 0 0 0 0 0 0 0 0 0 ... $ public_health_unit: chr [1:12036] "Algoma_Public_Health_Unit" "Brant_County_Health_Unit" "Chatham-Kent_Health_Unit" "Durham_Region_Health_Department" ... $ new_cases : num [1:12036] NA NA NA NA NA NA NA NA NA NA ...
NA values from our data with replace_na()¶Our conversion to long format creates 11,764 observations relating a Date to a new_cases value in a specific Public_Health_Unit (or total). From the looks of our data, however, we have a number of NA values under our new_cases variable. Let's replace the cases with a new value 0 with replace_na(). This function will need two parameters:
data: the data frame or vector that it will scan for NA values.replace: the value that we will use to replace NA.# Pivot the data into a long-format set and remove NAs from the value table
covid_phu_long.df <- covid_phu.df %>%
pivot_longer(cols = c(2:35), names_to = "public_health_unit", values_to = "new_cases") %>%
# Change the values of "new_cases" using the mutate function
mutate(new_cases = replace_na(data = .$new_cases, replace = 0))
# review the final table
head(covid_phu_long.df)
| Date | Total | public_health_unit | new_cases |
|---|---|---|---|
| <date> | <dbl> | <chr> | <dbl> |
| 2020-03-24 | 0 | Algoma_Public_Health_Unit | 0 |
| 2020-03-24 | 0 | Brant_County_Health_Unit | 0 |
| 2020-03-24 | 0 | Chatham-Kent_Health_Unit | 0 |
| 2020-03-24 | 0 | Durham_Region_Health_Department | 0 |
| 2020-03-24 | 0 | Eastern_Ontario_Health_Unit | 0 |
| 2020-03-24 | 0 | Grey_Bruce_Health_Unit | 0 |
# Check that we have covered all of the NA values in our data frame by looking for complete cases
#nrow(covid_phu_long.df[complete.cases(covid_phu_long.df),])
covid_phu_long.df %>%
complete.cases() %>% # logical vector of complete cases
sum() # logicals can be summed to get the total number of TRUE cases!
# Take a look at the Public Health Unit names
print(unique(covid_phu_long.df$public_health_unit))
#covid_phu_long.df %>% select(public_health_unit) %>% unique()
[1] "Algoma_Public_Health_Unit" [2] "Brant_County_Health_Unit" [3] "Chatham-Kent_Health_Unit" [4] "Durham_Region_Health_Department" [5] "Eastern_Ontario_Health_Unit" [6] "Grey_Bruce_Health_Unit" [7] "Haldimand-Norfolk_Health_Unit" [8] "Haliburton,_Kawartha,_Pine_Ridge_District_Health_Unit" [9] "Halton_Region_Health_Department" [10] "Hamilton_Public_Health_Services" [11] "Hastings_and_Prince_Edward_Counties_Health_Unit" [12] "Huron_Perth_District_Health_Unit" [13] "Kingston,_Frontenac_and_Lennox_&_Addington_Public_Health" [14] "Lambton_Public_Health" [15] "Leeds,_Grenville_and_Lanark_District_Health_Unit" [16] "Middlesex-London_Health_Unit" [17] "Niagara_Region_Public_Health_Department" [18] "North_Bay_Parry_Sound_District_Health_Unit" [19] "Northwestern_Health_Unit" [20] "Ottawa_Public_Health" [21] "Peel_Public_Health" [22] "Peterborough_Public_Health" [23] "Porcupine_Health_Unit" [24] "Region_of_Waterloo,_Public_Health" [25] "Renfrew_County_and_District_Health_Unit" [26] "Simcoe_Muskoka_District_Health_Unit" [27] "Southwestern_Public_Health" [28] "Sudbury_&_District_Health_Unit" [29] "Thunder_Bay_District_Health_Unit" [30] "Timiskaming_Health_Unit" [31] "Toronto_Public_Health" [32] "Wellington-Dufferin-Guelph_Public_Health" [33] "Windsor-Essex_County_Health_Unit" [34] "York_Region_Public_Health_Services"
str_remove_all()¶Looking at our PHU names, we can see that there is a lot of redundancy in our names. We see they end in some form of:
We also see the odd , here and there but we'll leave that alone for now.
We have a couple of choices but we can either use str_replace_all() or a specific version of that str_remove_all() which simplies replaces a pattern with an empty character. For str_replace_all() we will supply:
string: a single string or vector of strings.pattern: the pattern we wish to search for in the form of a string or regular expression.replace: the replacement string we wish to use.# Clean up the Public Health Unit names
covid_phu_long.df <- covid_phu_long.df %>%
mutate(public_health_unit =
str_replace_all(string = .$public_health_unit,
pattern = c("(_Public\\w*)|(_Health\\w*)"),
replace = ""
)
) %>%
mutate(public_health_unit = factor(public_health_unit))
# Take a look at the new set of phu names
print(levels(covid_phu_long.df$public_health_unit))
[1] "Algoma" [2] "Brant_County" [3] "Chatham-Kent" [4] "Durham_Region" [5] "Eastern_Ontario" [6] "Grey_Bruce" [7] "Haldimand-Norfolk" [8] "Haliburton,_Kawartha,_Pine_Ridge_District" [9] "Halton_Region" [10] "Hamilton" [11] "Hastings_and_Prince_Edward_Counties" [12] "Huron_Perth_District" [13] "Kingston,_Frontenac_and_Lennox_&_Addington" [14] "Lambton" [15] "Leeds,_Grenville_and_Lanark_District" [16] "Middlesex-London" [17] "Niagara_Region" [18] "North_Bay_Parry_Sound_District" [19] "Northwestern" [20] "Ottawa" [21] "Peel" [22] "Peterborough" [23] "Porcupine" [24] "Region_of_Waterloo," [25] "Renfrew_County_and_District" [26] "Simcoe_Muskoka_District" [27] "Southwestern" [28] "Sudbury_&_District" [29] "Thunder_Bay_District" [30] "Timiskaming" [31] "Toronto" [32] "Wellington-Dufferin-Guelph" [33] "Windsor-Essex_County" [34] "York_Region"
# Take a quick look at our final dataset
head(covid_phu_long.df)
| Date | Total | public_health_unit | new_cases |
|---|---|---|---|
| <date> | <dbl> | <fct> | <dbl> |
| 2020-03-24 | 0 | Algoma | 0 |
| 2020-03-24 | 0 | Brant_County | 0 |
| 2020-03-24 | 0 | Chatham-Kent | 0 |
| 2020-03-24 | 0 | Durham_Region | 0 |
| 2020-03-24 | 0 | Eastern_Ontario | 0 |
| 2020-03-24 | 0 | Grey_Bruce | 0 |
rename() variables for clarity¶Now that we have the basic structure for our data, we want to clean it up just a little bit by renaming our Total column to clarify that it represents total new cases across all PHUs for that date. Why did we keep this column separate? Now we can use this information to generate percentage totals for each PHU if we choose to.
We'll use rename() from dplyr to accomplish the task of renaming our column. There are a number of ways you could accomplish this without using dplyr but the simplicity of it is nice.
# Rename our Total column to clarify it's meaning
covid_phu_long.df %>%
rename(total_phu_new = Total,
date = Date) %>%
head()
| date | total_phu_new | public_health_unit | new_cases |
|---|---|---|---|
| <date> | <dbl> | <fct> | <dbl> |
| 2020-03-24 | 0 | Algoma | 0 |
| 2020-03-24 | 0 | Brant_County | 0 |
| 2020-03-24 | 0 | Chatham-Kent | 0 |
| 2020-03-24 | 0 | Durham_Region | 0 |
| 2020-03-24 | 0 | Eastern_Ontario | 0 |
| 2020-03-24 | 0 | Grey_Bruce | 0 |
relocate()¶The last cleanup we can accomplish with our data is to move total_phu_new to the last column of our data frame. This is for personal preference but also makes more sense when simply looking at the data. The relocate() verb from dplyr accomplishes this with ease since we are not dropping or removing columns. It uses some extra syntax to help accomplish its functions:
.data: the data frame or tibble we want to alter...: the columns we wish to move.before or .after: determines the destination of the columns. Supplying neither will move columns to the left-hand side.In fact, relocate() can be used to rename a column as well but it will also be moved by default so consider the ramifications of such an action!
# Rename our Total column to clarify it's meaning
covid_phu_long.df <- covid_phu_long.df %>%
rename(total_phu_new = Total,
date = Date) %>%
# relocate our total column to the right side
relocate(total_phu_new, .after = new_cases)
head(covid_phu_long.df)
| date | public_health_unit | new_cases | total_phu_new |
|---|---|---|---|
| <date> | <fct> | <dbl> | <dbl> |
| 2020-03-24 | Algoma | 0 | 0 |
| 2020-03-24 | Brant_County | 0 | 0 |
| 2020-03-24 | Chatham-Kent | 0 | 0 |
| 2020-03-24 | Durham_Region | 0 | 0 |
| 2020-03-24 | Eastern_Ontario | 0 | 0 |
| 2020-03-24 | Grey_Bruce | 0 | 0 |
At this point we have completed the data wrangling we want to accomplish on this dataset. We've converted it to a long-format and renamed the PHU entries while removing an NA values that may cause issues. There are a number of ways we could save this data now either as a text file or in its current form as a data frame in a .RData format.
write_delim(), write_csv(), write_tsv(), write_excel_csv()write_lines() save()load()Let's try some of those methods now.
# Check the files names we currently have
print(dir("./data/"))
[1] "Ontario_covidtesting.csv" [2] "Ontario_daily_change_in_cases_by_phu.csv" [3] "Ontario_daily_change_in_cases_by_phu_long.RData" [4] "Ontario_daily_change_in_cases_by_phu_long.tsv"
# Write covid_phu_long.df to a tab-delimited file
write_tsv(covid_phu_long.df, file = "./data/Ontario_daily_change_in_cases_by_phu_long.tsv")
# Check our file names after writing
print(dir("./data/"))
[1] "Ontario_covidtesting.csv" [2] "Ontario_daily_change_in_cases_by_phu.csv" [3] "Ontario_daily_change_in_cases_by_phu_long.RData" [4] "Ontario_daily_change_in_cases_by_phu_long.tsv"
# Save our data frame as an object
save(covid_phu_long.df, file="./data/Ontario_daily_change_in_cases_by_phu_long.RData")
# Check our file names after saving
print(dir("./data/"))
[1] "Ontario_covidtesting.csv" [2] "Ontario_daily_change_in_cases_by_phu.csv" [3] "Ontario_daily_change_in_cases_by_phu_long.RData" [4] "Ontario_daily_change_in_cases_by_phu_long.tsv"
readxl and writexl for working with excel spreadsheets¶Not all of your data may come as a comma- or tab-delimited format. In the case of excel spreadsheets there are some packages available that can also facilitate the parsing of these more complex files. The readxl package is part of the tidyverse but writexl package is not. There are other means of writing to an excel file format but they are dependent on other programs or drivers.
From the readxl package
excel_sheets()read_excel()From the writexl package (not a part of the tidyverse) but independent of Java and Excel
write_xlsx()ggplot2¶We now have some data in a tidy format that we'd like to visualize. We can begin with some initial analyses of the data using the ggplot2 package. It has all of the components we need to help us decide on which data we want to focus on or keep. There are a number of way to visualize our data and here we will refresh our ggplot skills.
Basic ggplot notes:
ggplot objects hold a complex number of attributes but always need an initial source of dataggplot objects can be modified with the + symbol by adding in layersggplot objects can be plotted, saved, and passed around.# Adjust our plot window sizes for us
options(repr.plot.width=21, repr.plot.height=7)
# Initialize a plot with our data
phu.plot <- ggplot(covid_phu_long.df)
# Take a quick look at the structure of the data
str(phu.plot)
List of 9
$ data : tibble [12,036 x 4] (S3: tbl_df/tbl/data.frame)
..$ date : Date[1:12036], format: "2020-03-24" "2020-03-24" ...
..$ public_health_unit: Factor w/ 34 levels "Algoma","Brant_County",..: 1 2 3 4 5 6 7 8 9 10 ...
..$ new_cases : num [1:12036] 0 0 0 0 0 0 0 0 0 0 ...
..$ total_phu_new : num [1:12036] 0 0 0 0 0 0 0 0 0 0 ...
$ layers : list()
$ scales :Classes 'ScalesList', 'ggproto', 'gg' <ggproto object: Class ScalesList, gg>
add: function
clone: function
find: function
get_scales: function
has_scale: function
input: function
n: function
non_position_scales: function
scales: NULL
super: <ggproto object: Class ScalesList, gg>
$ mapping : Named list()
..- attr(*, "class")= chr "uneval"
$ theme : list()
$ coordinates:Classes 'CoordCartesian', 'Coord', 'ggproto', 'gg' <ggproto object: Class CoordCartesian, Coord, gg>
aspect: function
backtransform_range: function
clip: on
default: TRUE
distance: function
expand: TRUE
is_free: function
is_linear: function
labels: function
limits: list
modify_scales: function
range: function
render_axis_h: function
render_axis_v: function
render_bg: function
render_fg: function
setup_data: function
setup_layout: function
setup_panel_guides: function
setup_panel_params: function
setup_params: function
train_panel_guides: function
transform: function
super: <ggproto object: Class CoordCartesian, Coord, gg>
$ facet :Classes 'FacetNull', 'Facet', 'ggproto', 'gg' <ggproto object: Class FacetNull, Facet, gg>
compute_layout: function
draw_back: function
draw_front: function
draw_labels: function
draw_panels: function
finish_data: function
init_scales: function
map_data: function
params: list
setup_data: function
setup_params: function
shrink: TRUE
train_scales: function
vars: function
super: <ggproto object: Class FacetNull, Facet, gg>
$ plot_env :<environment: R_GlobalEnv>
$ labels : Named list()
- attr(*, "class")= chr [1:2] "gg" "ggplot"
We now have a basic plot object initialized but we need to tell it how to display the data associated with it. We'll begin with a simple line graph of all the public health units across all dates within the set.
In order to update or add layers to a ggplot object, we can use the + symbol for each command. For instance, to define the source of x-axis and y-axis data, we use aes() command to update the aesthetics layer. Remember how we defined the public_health_unit variable as a factor? We'll take advantage of that here and tell ggplot to give each PHU it's own colour.
After defining our aesthetics, we still need to tell ggplot how to actually graph the data. The ggplot package comes with an abundance of visualizations accessed through the geom_*() commands. Some examples include
geom_point() for scatterplotsgeom_line() for line graphsgeom_boxplot() for boxplotsgeom_violin() for violin plotsgeom_bar() for bargraphsgeom_histogram() for histograms# Update the aesthetics with axis and colour information, then add a line graph!
phu.plot +
aes(x = ..., y = ..., colour = ...) +
geom_line() +
guides(colour = guide_legend(title="Public Health Unit")) # Legend title
facet_wrap() command to break PHUs into separate graphs¶There's a lot of data on that graph and some of it is quite drowned out because of the scale of PHUs with many more cases. To break out each PHU individually, we can add the facet_wrap() command. We'll also update some of the parameters:
scale: we will update this so each y-axis scale is determined by PHU-specific data.ncol: use this to set the number of columns displayed in our gridAt the same time, we'll also get rid of the legend since each individual graph will be labeled by its PHU.
# This is going to be a big graph so adjust our plot window sizes for us
options(repr.plot.width=20, repr.plot.height=30)
# Add a facet_wrap and get rid of the legend
phu_facet <- phu.plot +
aes(x = date, y = new_cases, colour = public_health_unit) +
geom_line() +
facet_wrap(~ ..., scales = "free_y", ncol=4) +
theme(legend.position = "none")
phu_facet
ggsave() command to save your plots to a file¶There are a number of ways you can use the ggsave() command to specify how you want to save your files.
ggsave(plot = phu_facet, filename = "Ontario_phu_data.all.facet.png", scale=2, device = "png", units = c("cm"))
Although we do have a running total for each date, what if we want to look at the totals cases across subsets of the PHUs? Using a barplot we can stack cases by date and get a sense of daily case totals from which sets of PHUs we desire.
This time we will use geom_bar() to display our data and tell it to use the values from our new_cases variable to generate the totals. We do this by setting the stat = "identity" parameter.
At the same time, let's update our colours to use a colour-blind friendly palette scheme.
# This is going to be a simpler graph so adjust our plot window size accordingly
options(repr.plot.width=20, repr.plot.height=10)
phu.plot +
aes(x = date, y= new_cases, fill = public_health_unit) + # set our fill colour instead of line colour
... +
guides(fill = guide_legend(title="Public Health Unit")) +
scale_fill_viridis_d() # the "d" stands for discrete
From above we get a sense of overall totals for some PHU distributions but it's still too much to look at. Let's transform our x-axis values so we can bin by months instead. To accomplish this we'll use the as.yearmon() function found in the zoo package we loaded at the beginning of the talk.
phu.plot +
aes(x = ..., y = new_cases, fill = public_health_unit) +
geom_bar(stat="identity") +
guides(fill = guide_legend(title="Public Health Unit")) +
scale_fill_viridis_d() +
xlab("Date") +
ylab("New cases") +
ggtitle("New cases per month across Ontario Public Health Units") +
theme(text = element_text(size = 20)) # set text size
Now that we have taken an initial look at our data, we can see that even after converting our axis to a month-year format, it appears that some of the data isn't that relevant for us. Some of the PHUs are not generating many new cases per day so we can now consider slicing our data up to look at specific regions.
Let's look at the top 10 regions by total caseload across the dataset.
# What are the top 10 regions by total caseload?
covid_phu_long.df %>%
# group the data by public health unit
group_by(public_health_unit) %>%
# Summarize it by the total number of new cases in each PHU
summarise(total_cases = sum(new_cases)) %>%
# Sort all of the data in descending order by total cases
arrange(desc(total_cases)) %>%
# take the top 10 PHUs
.[1:10, ]
# Generate a list of all PHUs and sort by total caseload
phu_by_total_cases_desc <- covid_phu_long.df %>%
# Group by public health unit
group_by(public_health_unit) %>%
# Based on public health unit, sum the total cases
summarise(total_cases = sum(new_cases)) %>%
# Sort by descending order
arrange(desc(total_cases)) %>%
# Grab the PHU names and convert them into a character vector
select(public_health_unit) %>%
unlist() %>% as.character()
# Display the PHU names
print(phu_by_total_cases_desc)
filter() command to make a subset of our data¶Now that we have a list of PHUs ordered by descending total cases, we can use that to filter our covid_phu_long.df dataframe and graph only the more heavily infected PHUs. We can then pipe the filtere graph over to make a ggplot() object. At the same time we'll do a few more things:
# Make a bar graph
covid_phu_long.df %>%
# Filter our data based on the PHUs we want to see
filter(...) %>%
# Redirect our new data frame to ggplot
ggplot(.) +
aes(x = as.yearmon(date), y = new_cases, fill = fct_reorder(public_health_unit, new_cases)) +
geom_bar(stat="identity") +
guides(fill = guide_legend(title="Public Health Unit")) +
scale_fill_viridis_d()+
xlab("Date") +
ylab("New cases") +
ggtitle("New cases per day across top 3 Ontario Public Health Units") +
theme(text = element_text(size = 20)) # set text size
We can see from our first graph of daily case loads that there can be quite a bit of variability from day to day. Rather than look at the daily tally of new cases, perhaps we can take into account the overall number of new cases appearing in a 14-day sliding window. Given that symptoms from time of infection can take between 5-14 days to manifest, then a portion of daily positive cases can be the result of infection going back as far as 14-days. Taking a look at a 14-day window will also smooth out our data as a line graph.
To accomplish this we'll need to perform some transformations on our dataset.
We'll want to track observations by:
# Shut down some output information from the summarise function
options(dplyr.summarise.inform = FALSE)
# 1. group our data by public health unit
covid_phu_long.df <- covid_phu_long.df %>% group_by(public_health_unit)
# 2. get a complete list of case dates
case.dates <- unique(covid_phu_long.df$date)
# 3. set up a table to hold our summarised results
phu_window_data.df = ... (public_health_unit = character(0),
window_mean = numeric(0),
start_date = numeric(0), end_date = numeric(0))
# Iterate through the dates in a 14-day sliding window
for (i in 1:(length(case.dates)-13)) {
curr.set <- covid_phu_long.df %>%
# Filter for a set of data that spans 14 days
filter(date %in% case.dates[i:(i+13)]) %>%
# Summarize that data based on public health unit
summarize(window_mean = mean(new_cases))
# Track the start and end dates of the window
curr.set$start_date = case.dates[i]
curr.set$end_date = case.dates[i+13]
# Add this table to the collected data
phu_window_data.df <- rbind(phu_window_data.df, curr.set)
}
# Check on the final structure of the data
str(phu_window_data.df)
Now that we've generated our windowed data, let's plot the top 5 PHUs by caseload. Let's also annotate some dates from the 2020 pandemic history:
# Build our plot and save to an object
phu_window.plot <- phu_window_data.df %>%
# Filter for the top 5 infected PHUs
filter(public_health_unit %in% phu_by_total_cases_desc[1:5]) %>%
# redirect the filtered result to ggplot
ggplot() +
aes(x = ..., y = ..., colour = fct_reorder(public_health_unit, window_mean, .desc=TRUE)) +
geom_line(size=2) +
scale_color_viridis_d() +
theme_bw() + # Simplify the theme
xlab("Date") +
ylab("Mean cases in 14-day window") +
ggtitle("Mean cases in a 14-day window across top 5 Ontario Public Health Units") +
guides(colour = guide_legend(title="Public Health Unit")) + # set our legend name
theme(panel.grid.major.y = element_line(color="grey95")) + # darken our major y grid
theme(panel.grid.minor.y = element_blank()) + # remove our minor y grid
theme(panel.grid.minor.x = element_blank()) + # remove our minor x grid
theme(text = element_text(size = 20)) + # set text size
# Start looking at data from July 2020 onwards
scale_x_date(limits = c(as.Date(...), as.Date(max(phu_window_data.df$start_date))),
date_breaks = "1 month", date_labels = "%b-%Y") +
# Annotate windows of various milestones
geom_text(aes(x=as.Date("2020-07-31") +7 , label = "Toronto enters stage 3", y=500), angle=90, size=10, colour="black") +
annotate("rect", xmin=as.Date("2020-07-31"), xmax=as.Date("2020-07-31")+14, ymin=-Inf, ymax=Inf, fill="grey", alpha=0.2, ) +
geom_text(aes(x=as.Date("2020-09-15") + 7, label = "School starts", y=500), angle=90, size=10, colour="black") +
annotate("rect", xmin=as.Date("2020-09-15"), xmax=as.Date("2020-09-15")+14, ymin=-Inf, ymax=Inf, fill="orange", alpha=0.2) +
geom_text(aes(x=as.Date("2020-12-26") + 7, label = "Province-wide lockdown", y=500), angle=90, size=10, colour="black") +
annotate("rect", xmin=as.Date("2020-12-26"), xmax=as.Date("2020-12-26")+14, ymin=-Inf, ymax=Inf, fill="red", alpha=0.2)
# plot our object to standard output
phu_window.plot
Today we've covered just a small example of how to import, format, and visualize data from outside sources. There are, however, a number of visualizations and avenues we haven't explored. Some other popular visualizations to consider
There are more advanced packages that simplify things like
Making interactive web apps through the R Shiny package!
Web scraping for the most popular pandemic sourdough bread recipes!
There are a number of potential COVID-19 data resources out there but here are some comprehensive ones related to Canada and beyond
The province of Ontario makes summaries of it's COVID-19 case data available here
A talented graduate student, Jean-Paul R. Soucy cofounded the COVID-19 Canada Open Data Working Group and maintains a GitHub repository of COVID-19 statistics used by the Canadian Federal government. It is updated on a daily basis. His homepage is great too!
The National Center for Biotechnology Information (NCBI) is the definitive source for curated genomic DNA from all areas of life. It also has a dedicated SARS-CoV-2 portal for accessing all of its resources (sequencing data, publications etc.) on SARS-CoV-2.
The R programming language provides a great framework for data analysis. There are pre-existing packages that can facilitate your analysis of biological data and powerful tools for the statistical analysis and modeling of the growing datasets generated by the pandemic.
We've only had time to scrape the surface of what this language can do for you but you now also have a platform for practicing and growing your skills in this language!
I've included a large appendix covering some extra examples and finer details that we had to forego in this presentation where you can explore the syntax and language of R as well as another example of data cleanup and visualization with real-world COVID-19 data.
You can also find a code-complete version of this talk in HTML format here
Let's switch gears and take a look at another dataset from Ontario. Rather than breaking down cases by public health unit, this tracks total cases across Ontario along with different categories such as hospitalizations, long-term care facilities, and some of the more recent variants.
Looking deeply at the variant information, it appears that Ontario begins tracking variant data on 2021-01-29. Let's build a dataset from that point onwards
# Open your dataset
covid_cases.df <- read_csv("./data/Ontario_covidtesting.csv")
str(covid_cases.df)
rename_with()!¶As you can see all of the column names are problematic for working with. We should replace all of the white-space characters and dashes with the undescore character. The R interpreter hates white space... and this will make working with the column names much easier for us.
# rename the columns to remove the spaces
covid_cases.df <- covid_cases.df %>%
# remove the spaces AND dashes and replace them
rename_with(., ~ tolower(gsub("\\s|-", "_", .x)))
print(colnames(covid_cases.df))
The data provided by the Ontario government represents a daily cumulative tally of the 3 variants of concern: B.1.1.7, B.1.351, and P.1. We want to convert these numbers into a daily incidence value. We'll be going through a large number of transformations to accomplish this.
At the same time we want to figure out how many new cases are being reported daily. We can use a the diff() function on a vector to subtract neighbour elements from each other and we'll take advantage of that with the total_cases column.
The GH variant represents the dominant strain of SARS-CoV-2 in North America for most of 2020. From our calculation of daily new cases, we can also estimate the number of GH variant cases by subtracting the other variants reported on that day.
After generating those 4 values for each day, we'll convert the table to a long format so we can graph our data.
# Filter data to when we begin tracking variants
variant_cases.df <- covid_cases.df %>%
filter(.[,1] >= "2021-01-28") %>% # go back an extra day so we can calculate daily cases
mutate(daily_cases = c(NA, diff(total_cases))) %>% # How many cases per day?
select(c(1, 20:23)) %>% # Make a new dataframe from columns with variant information
replace(., is.na(.), 0) %>% # replace NA values
# Rename the columns for simplicity by their variant
rename_with(., ~ toupper(str_replace_all(.x, pattern="total_lineage_",
replacement = "")),
contains("lineage")) %>%
# calculate the daily new cases for each variant.
# We use the NA as a placeholder because it generates one less value
# than the number of rows in our dataframe
mutate(B.1.1.7 = c(NA, diff(.$B.1.1.7)),
B.1.351 = c(NA, diff(.$B.1.351)),
P.1 = c(NA, diff(.$P.1))) %>%
# Set the dataframe to rowwise calculations
rowwise() %>%
# Sum across rows to calculate the main-strain values
mutate(GH_variant = daily_cases - sum(B.1.1.7, B.1.351, P.1, na.rm=TRUE)) %>%
# reset the dataframe
ungroup() %>%
# drop the first row which has NA placeholder data
slice(-1) %>%
# Convert the whole table to long format
pivot_longer(cols = c(2:4, 6), names_to="variant", values_to="new_cases") %>%
# Make sure the variant variable is a factor
mutate(variant = factor(variant))
head(variant_cases.df)
Since the number of variant cases are all on very different scales, it's better if we facet by variant and look at the daily case numbers.
# Plot our variant data
variants.plot <- variant_cases.df %>%
ggplot(.) +
aes(x = reported_date, y = new_cases, colour = fct_reorder(variant, new_cases, .desc=TRUE)) +
geom_line(size=2) +
scale_color_viridis_d() +
theme_bw() + # Simplify the theme
xlab("Date") +
ylab("New cases") +
ggtitle("Daily variant cases reported across Ontario in 2021") +
guides(colour = guide_legend(title="Variant")) + # set our legend name
theme(panel.grid.major.y = element_line(color="grey95")) + # darken our major y grid
theme(panel.grid.minor.y = element_blank()) + # remove our minor y grid
theme(panel.grid.minor.x = element_blank()) + # remove our minor x grid
theme(text = element_text(size = 20)) + # set text size
# Label our data weekly
scale_x_date(date_breaks = "1 week", date_labels = "%b-%d") +
facet_wrap(~variant, scales="free_y")
# plot our object to standard output
variants.plot
Looks like there is a lot of variation in the day-to-day reporting. This could be a result of the process of how/when samples are sent for variant testing. Maybe a 7-day window will look better? Much like how we smoothed out our PHU data, using mean case number across a sliding window would perhaps look better.
We'll just recycle some of our code from earlier.
# Generate a 7-day average window to smooth out our data
# 1. group our data by public health unit
variant_cases.df <- variant_cases.df %>% group_by(variant)
# 2. get a complete list of case dates
case.dates <- unique(variant_cases.df$reported_date)
# 3. set up a table to hold our summarised results
variant_window_data.df = data.frame(variant = character(0),
window_mean = numeric(0),
start_date = numeric(0), end_date = numeric(0))
# Iterate through the dates in a 7-day sliding window
for (i in 1:(length(case.dates)-6)) {
curr.set <- variant_cases.df %>%
# Filter for a set of data that spans 14 days
filter(reported_date %in% case.dates[i:(i+6)]) %>%
# Summarize that data based on public health unit
summarize(window_mean = mean(new_cases))
# Track the start and end dates of the window
curr.set$start_date = case.dates[i]
curr.set$end_date = case.dates[i+6]
# Add this table to the collected data
variant_window_data.df <- rbind(variant_window_data.df, curr.set)
}
# Check on the final structure of the data
head(variant_window_data.df)
We'll wrap up this example with the smoothed 7-day window data plotted as a line graph and we'll fill the empty space below the line graph with bars representing the 7-day average data as well.
# Plot our variant data
variants_window.plot <- variant_window_data.df %>%
ggplot(.) +
aes(x = start_date, y = window_mean, colour = fct_reorder(variant, window_mean, .desc=TRUE)) +
geom_line(size=2) +
scale_color_viridis_d() +
geom_bar(stat = "identity", aes(fill=fct_reorder(variant, window_mean, .desc=TRUE)), alpha=0.4) +
scale_fill_viridis_d() +
theme_bw() + # Simplify the theme
xlab("Date") +
ylab("Mean cases in 7-day window") +
ggtitle("Variant cases reported in 2021 across Ontario, 7-day average") +
guides(fill = guide_legend(title="Variant")) + # set our fill legend name
guides(colour = FALSE) + # remove our colour legend as it will be redundant
theme(panel.grid.major.y = element_line(color="grey95")) + # darken our major y grid
theme(panel.grid.minor.y = element_blank()) + # remove our minor y grid
theme(panel.grid.minor.x = element_blank()) + # remove our minor x grid
theme(text = element_text(size = 20)) + # set text size
# Start looking at data from July 2020 onwards
scale_x_date(date_breaks = "1 week", date_labels = "%b-%d") +
facet_wrap(~variant, scales="free_y")
# plot our object to standard output
variants_window.plot
Let's discuss some important behaviours before we begin coding:
#¶Why bother?
Your worst collaborator is potentially you in 6 days or 6 months. Do you remember what you had for breakfast last Tuesday?
You can annotate your code for selfish reasons, or altruistic reasons, but annotate your code.
How do I start?
It is, in general, part of best coding practices to keep things tidy and organized.
A hash-tag # will comment your text. Inside a code cell in a Jupyter Notebook or anywhere in an R script, all text after a hashtag will be ignored by R and by many other programming languages. It's very useful to add comments about changes in your code, as well as detailed explanations about your scripts.
Put a description of what you are doing near your code at every process, decision point, or non-default argument in a function. For example, why you selected k=6 for an analysis, or the Spearman over Pearson option for your correlation matrix, or quantile over median normalization, or why you made the decision to filter out certain samples.
Break your code into sections to make it readable. Scripts are just a series of steps and major steps should be titled/outlined with your reasoning - much like when presenting your research.
Give your objects informative object names that are not the same as function names.
Comments may/should appear in three places:
# At the beginning of the script, describing the purpose of your script and what you are trying to solve
bedmasAnswer <- 5 + 4 * 6 - 0 #In line: Describing a part of your code that is not obvious what it is for.
Maintaining well-documented code is also good for mental health!
Basically, you have the following options:
The most important aspects of naming conventions are being concise and consistent!
Use version control.
For more information on best coding practices, please visit swcarpentry
We all run into problems. We'll see a lot of mistakes happen in class too! That's OK if we can learn from our errors and quickly (or eventually) recover.
getwd() to check where you are working, typelist.files() or the Files pane to check that your file exists there, and setwd() to change your directory if necessary. Preferably, work inside an R project with all project-related files in that same folder. Your working directory will be set automatically when you open the project (this can be done by using File -> New Project and following prompts).typeof() and class() to check what type of data you have. Use str() to peak at your data structures if you're making assumptions about it.help("function"), ?function (using the name of the function that you want to check), or help(package = "package_name"). library("package_name"). If you only need one function from a package, or need to specify to what package a function belongs because there are functions with the same name that belong to different packages, you can use a double colon, i.e. package_name::function_name.session aborted can happen for a variety of reasons, like not having enough computational power to perform a task or also because of a system-wide failure. 0. You will need to rerun your previous cells! Including the program, version, error, package and function helps, be specific. Sometimes is useful include your operating system and version (Windows 10, Ubuntu 18, Mac OS 10, etc.).
You may run into assignment questions where the tools I've provided in lecture are not enough to reproduce the example output exactly as provided. If you wish to go that extra mile you may need to look for answers elsewhere by consulting references from the class or searching for it yourself.
Remember: Everyone looks for help online ALL THE TIME. It is very common. Also, with programming there are multiple ways to come up with an answer, even different packages that let you do the same thing in different ways. You will work on refining these aspects of your code as you go along in this course and in your coding career.
Last but not least, to make life easier: Under the Help pane, there is a cheatsheet of Jupyter notebook keyboard shortcuts or a browser list here.
Lists can hold mixed data types of different lengths. These are especially useful for bundling data of different types for passing around your scripts, to functions, or receiving output from functions! Rather than having to call multiple variables by name, you can store them in a single list!
If you forget what is in your list, use the str() function to check out its structure. It will tell you the number of items in your list and their data types.
# Make a list of various items
mixed.list <- list(countries = character.vector, values = numeric.vector, mixed.data = mixed.df)
# Look at some information about our list
str(mixed.list)
names(mixed.list)
Accessing lists is much like opening up a box of boxes of chocolates. You never know what you're gonna get when you forget the structure!
You can access elements with a mixture of number and naming annotations much like data frames. Also [[x]] is meant to access the xth "element" of the list.
[x] returns a list object with your element(s) of choice in the list. [[x]] returns a single element only# Subset our list with []
mixed.list[c(1,3,2)]
mixed.list["values"]
# Pull out a single element
mixed.list[[2]]
mixed.list[["countries"]]
# Give a vector as input to [[]]
mixed.list[[c(1,3)]]
mixed.list[[c(3,1,1)]]
You can specify which columns of strings are converted to factors at the time of declaring your column information. Alternatively you can coerce character vectors to factors after generating them.
R by default puts factor levels in alphabetical order. This can cause problems if we aren't aware of it. You can check the order of your factor levels with the levels() command. Furthermore you can specify, during factor creation, your level order.
Always check to make sure your factor levels are what you expect.
With factors, we can deal with our character levels directly, or their numeric equivalents.
# Generate a data frame and include factors
str(data.frame(country = character.vector,
values = numeric.vector[2:4],
commonwealth = logical.vector[1:3],
continent = factor(c("North America", "North America", "Europe"),
levels = c("North America", "Europe"))
)
)
# Coerce a factor
mixed.df <- data.frame(country = character.vector,
values = numeric.vector[2:4],
commonwealth = logical.vector[1:3],
continent = c("North America", "North America", "Europe"))
mixed.df$continent <- factor(mixed.df$continent, levels=c("North America", "Europe"))
str(mixed.df)
levels() to list the levels and their order for your factorrelevel().ordered = TRUE.labels = c(). Note that level order is assigned before labels are added to your data. You are essentially labeling the integer assigned to your factor levels so be careful when using this parameter!Yes, you can treat data frames and arrays like large lists where mathematical operations can be applied to individual elements or to entire columns or more!
Therefore be careful to specify your numeric data for mathematical operations.
mixed.df
mixed.df$values + 3
mixed.df$values * 4
# implicit coercion of logical to integer
mixed.df$commonwealth * 5
# Perform math on a factor
mixed.df$continent * 6
as.numeric(mixed.df$continent) * 7
# Can we perform math on non-numeric variables?
#mixed.df$country + 8
apply() family of functions to perform actions across data structures¶The above are illustrative examples to see how our different data structures behave. In reality, you will want to do calculations across rows and columns, and not on your entire matrix or data frame.
apply() function will recognize basic functions and use them on vectorized data¶For example, we might have a count table where rows are genes, columns are samples, and we want to know the sum of all the counts for a gene. To do this, we can use the apply() function. apply() Takes an array, matrix (or something that can be coerced to such, like a numeric data frame), and applies a function over row (MARGIN = 1) or columns (MARGIN = 2). Here we can invoke the sum function.
# Make a sample data frame of numeric values only
numeric.df = data.frame(geneA = numeric.vector, geneB = numeric.vector*2, geneC = numeric.vector*3)
str(numeric.df)
# Apply sum by columns
apply(numeric.df, 2, sum)
# Apply sum by rows
apply(numeric.df, 1, sum)
apply() family¶There are 3 additional members of the apply() family that perform similar functions with varying outputs
lapply(data, FUN, ...) is usuable on dataframes, lists, and vectors. It returns a list as output.
FUN will be applied from the ...sapply(data, FUN, ...) works similarly to lapply() except it tries to simplify the output to the most elementary data structure possible. ie it will return the simplest form of the data that makes sense as a representation.
mapply(FUN, data, ...) is short for "multivariate" apply and it applies a function to multiple lists or multiple vector arguments. # Use lapply on the columns of numeric.df
lapply(numeric.df, sum)
str(lapply(numeric.df, sum))
# Use sapply on the columns of numeric.df
sapply(numeric.df, sum)
str(sapply(numeric.df, sum))
# Using lapply and sapply and sum on an actual list
sum.list <- list(numeric.vector, numeric.df)
# lapply on the list
lapply(sum.list, sum)
# sapply on the list
sapply(sum.list, sum)
# Use lapply to select portions from a list
sum.list <- list(numeric.df, numeric.df)
# Extract the first row from each member of the list
lapply(sum.list, "[", 1,)
# Extract the 2nd column from each member of the list
lapply(sum.list, "[", , 2)
# Take a close look at what sapply returns in this case
sapply(sum.list, "[", , 2)
# Use mapply in an example on numeric.df
mapply(sum, numeric.vector, numeric.vector)
# Use mapply on the rep function to see its output
mapply(rep, c("repeat", "this", "phrase"), 4)
For this introductory course we will be teaching and running code for R through Jupyter notebooks. In this section we will discuss
As of 2021-01-18, The latest version of Anaconda3 runs with Python 3.8
Download the OS-appropriate version from here https://www.anaconda.com/products/individual
All versions should come with Python 3.8
Windows:
MacOS:
Unix:
As of 2020-12-11, the lastest version of r-base available for Anaconda is 4.0.3 but Anaconda comes pre-installed with R 3.6.1. To save time, we will update just our r-base (version) through the command line using the Anaconda prompt. You'll need to find the menu shortcut to the prompt in order to run these commands. Before class you should update all of your anaconda packages. This will be sure to get you the latest version of Jupyter notebook. Open up the Anaconda prompt and type the following command:
conda update --all
It will ask permission to continue at some point. Say 'yes' to this. After this is completed, use the following command:
conda install -c conda-forge/label/main r-base=4.0.3=hddad469_3
Anaconda will try to install a number of R-related packages. Say 'yes' to this.
Lastly, we want to connect your R version to the Jupyter notebook itself. Type the following command:
conda install -c r r-irkernel
Jupyter should now have R integrated into it. No need to build an extra environment to run it.
You may find that for some reason or another, you'd like to maintain a specific R-environment (or other) to work in. Environments in Anaconda work like isolated sandbox versions of Anaconda within Anaconda. When you generate an environment for the first time, it will draw all of its packages and information from the base version of Anaconda - kind of like making a copy. You can also create these in the Anaconda prompt. You can even create new environments based on specific versions or installations of other programs. For instance, we could have tried to make an environment for R 4.0.3 with the command
conda create -n my_R_env -c conda-forge/label/main r-base=4.0.3=hddad469_3
This would create a new environment with version 4.0.3 of R but the base version of Anaconda would retain version 3.6.1 of R. A small but helpful detail if you are unsure about newer versions of packages that you'd like to use.
Likewise, you can update and install packages in new environments without affecting or altering your base environment! Again it's helpful if you're upgrading or installing new packages and programs. If you're not sure how it will affect what you already have in place, you can just install them straight into an environment.
For more information: https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#cloning-an-environment
If you are inclined, the Anaconda Navigator can help you make an R environment separate from the base, but you won't be able to perform the same fancy tricks as in the prompt, like installing new packages directly to a new environment.
Note: You should consider doing this only if you have a good reason to isolate what you're doing in R from the Anaconda base packages. You will also need to have installed r-base 4.0.3 to make a new environment with it through the Anaconda navigator.
The Anaconda navigator is a graphical interface that shows all fo your pre-installed packages and give you access to installing other common programs like RStudio (we'll get to that in a moment).
You will now have an R environment where you can install specific R packages that won't make their way into your Anaconda base.
You will likely find a shortcut to this environment in your (Windows) menu under the Anaconda folder. It will look something like Jupyter Notebook (R-4-0-3)
Normally I suggest avoiding installing packages through your Jupyter Notebook. Instead, if you want to update your R packages for running Jupyter, it's best to add them through either the Anaconda prompt or Anaconda navigator. Again, using the prompt gives you more options but can seem a little more complicated.
One of the most useful packages to install for R is r-essentials. Open up the Anaconda prompt and use the command:
conda install -c r r-essentials. After running, the Anaconda prompt will inform you of any package dependencies and it will identify which packages will be updated, newly installed, or removed (unlikely).
Anaconda has multiple channels (similar to repositories) that exist and are maintained by different groups. These various channels port over regular R packages to a format that can be installed in Anaconda and run by R. The two main channels you'll find useful for this are the r channel and conda-forge channel. You can find more information about all of the packages on docs.anaconda.com. As you might have guessed the basic format for installing packages is this: conda install -c channel-name r-package where
conda-install is the call to install packages. This can be done in a base or custom environment
-c channel-name identifies that you wish to name a specific channel to install from
r-package is the name of your package and most of them will begin with r- ie r-ggplot2
As of 2020-06-25, the latest stable R version is 4.0.3:
Windows:
- Go to <http://cran.utstat.utoronto.ca/>
- Click on 'Download R for Windows'
- Click on 'install R for the first time'
- Click on 'Download R 4.0.3 for Windows' (or a newer version)
- Double-click on the .exe file once it has downloaded and follow the instructions.
(Mac) OS X:
- Go to <http://cran.utstat.utoronto.ca/>
- Click on 'Download R for (Mac) OS X'
- Click on R-4.0.3.pkg (or a newer version)
- Open the .pkg file once it has downloaded and follow the instructions.
Linux:
- Open a terminal (Ctrl + alt + t)
- sudo apt-get update
- sudo apt-get install r-base
- sudo apt-get install r-base-dev (so you can compile packages from source)
As of 2021-01-18, the latest RStudio version is 1.4.1103
Windows:
- Go to <https://www.rstudio.com/products/rstudio/download/#download>
- Click on 'RStudio 1.3.1093 - Windows Vista/7/8/10' to download the installer (or a newer version)
- Double-click on the .exe file once it has downloaded and follow the instructions.
(Mac) OS X:
- Go to <https://www.rstudio.com/products/rstudio/download/#download>
- Click on 'RStudio 1.3.1093 - Mac OS X 10.13+ (64-bit)' to download the installer (or a newer version)
- Double-click on the .dmg file once it has downloaded and follow the instructions.
Linux:
- Go to <https://www.rstudio.com/products/rstudio/download/#download>
- Click on the installer that describes your Linux distribution, e.g. 'RStudio 1.3.1093 - Ubuntu 18/Debian 10(64-bit)' (or a newer version)
- Double-click on the .deb file once it has downloaded and follow the instructions.
- If double-clicking on your .deb file did not open the software manager, open the terminal (Ctrl + alt + t) and type **sudo dpkg -i /path/to/installer/rstudio-xenial-1.3.959-amd64.deb**
_Note: You have 3 things that could change in this last command._
1. This assumes you have just opened the terminal and are in your home directory. (If not, you have to modify your path. You can get to your home directory by typing cd ~.)
2. This assumes you have downloaded the .deb file to Downloads. (If you downloaded the file somewhere else, you have to change the path to the file, or download the .deb file to Downloads).
3. This assumes your file name for .deb is the same as above. (Put the name matching the .deb file you downloaded).
If you have a problem with installing R or RStudio, you can also try to solve the problem yourself by Googling any error messages you get. You can also try to get in touch with me or the course TAs.
RStudio is an IDE (Integrated Development Environment) for R that provides a more user-friendly experience than using R in a terminal setting. It has 4 main areas or panes, which you can customize to some extent under Tools > Global Options > Pane Layout:
All of the panes can be minimized or maximized using the large and small box outlines in the top right of each pane.
The Source is where you are keeping the code and annotation that you want to be saved as your script. The tab at the top left of the pane has your script name (i.e. 'Untitled.R'), and you can switch between scripts by toggling the tabs. You can save, search or publish your source code using the buttons along the pane header. Code in the Source pane is run or executed automatically.
To run your current line of code or a highlighted segment of code from the Source pane you can:
a) click the button 'Run' -> 'Run Selected Line(s)',
b) click 'Code' -> 'Run Selected Line(s)' from the menu bar,
c) use the keyboard shortcut CTRL + ENTER (Windows & Linux) Command + ENTER (Mac) (recommended),
d) copy and paste your code into the Console and hit Enter (not recommended).
There are always many ways to do things in R, but the fastest way will always be the option that keeps your hands on the keyboard.
You can also type and execute your code (by hitting ENTER) in the Console when the > prompt is visible. If you enter code and you see a + instead of a prompt, R doesn't think you are finished entering code (i.e. you might be missing a bracket). If this isn't immediately fixable, you can hit Esc twice to get back to your prompt. Using the up and down arrow keys, you can find previous commands in the Console if you want to rerun code or fix an error resulting from a typo.
On the Console tab in the top left of that pane is your current working directory. Pressing the arrow next to your working directory will open your current folder in the Files pane. If you find your Console is getting too cluttered, selecting the broom icon in that pane will clear it for you. The Console also shows information: upon start up about R (such as version number), during the installation of packages, when there are warnings, and when there are errors.
In the Global Environment you can see all of the stored objects you have created or sourced (imported from another script). The Global Environment can become cluttered, so it also has a broom button to clear its workspace.
Objects are made by using the assignment operator <-. On the left side of the arrow, you have the name of your object. On the right side you have what you are assigning to that object. In this sense, you can think of an object as a container. The container holds the values given as well as information about 'class' and 'methods' (which we will come back to).
Type x <- c(2,4) in the Console followed by Enter. 1D objects' data types can be seen immediately as well as their first few values. Now type y <- data.frame(numbers = c(1,2,3), letters = c("a","b","c")) in the Console followed by Enter. You can immediately see the dimension of 2D objects, and you can check the structure of data frames and lists (more later) by clicking on the object's arrow. Clicking on the object name will open the object to view in a new tab. Custom functions created in session or sourced will also appear in this pane.
The Environment pane dropdown displays all of the currently loaded packages in addition to the Global Environment. Loaded means that all of the tools/functions in the package are available for use. R comes with a number of packages pre-loaded (i.e. base, grDevices).
In the History tab are all of the commands you have executed in the Console during your session. You can select a line of code and send it to the Source or Console.
The Connections tab is to connect to data sources such as Spark and will not be used in this lesson.
The Files tab allows you to search through directories; you can go to or set your working directory by making the appropriate selection under the More (blue gear) drop-down menu. The ... to the top left of the pane allows you to search for a folder in a more traditional manner.
The Plots tab is where plots you make in a .R script will appear (notebooks and markdown plots will be shown in the Source pane). There is the option to Export and save these plots manually.
The Packages tab has all of the packages that are installed and their versions, and buttons to Install or Update packages. A check mark in the box next to the package means that the package is loaded. You can load a package by adding a check mark next to a package, however it is good practice to instead load the package in your script to aid in reproducibility.
The Help menu has the documentation for all packages and functions. For each function you will find a description of what the function does, the arguments it takes, what the function does to the inputs (details), what it outputs, and an example. Some of the help documentation is difficult to read or less than comprehensive, in which case goggling the function is a good idea.
The Viewer will display vignettes, or local web content such as a Shiny app, interactive graphs, or a rendered html document.
I suggest you take a look at Tools -> Global Options to customize your experience.
For example, under Code -> Editing I have selected Soft-wrap R source files followed by Apply so that my text will wrap by itself when I am typing and not create a long line of text.
You may also want to change the Appearance of your code. I like the RStudio theme: Modern and Editor font: Ubuntu Mono, but pick whatever you like! Again, you need to hit Apply to make changes.
That whirlwind tour isn't everything the IDE can do, but it is enough to get started.